I’m really exited about this course. Here’s a couple of reasons why.
R code goes here!
“Life is a continuous learning path.”
Check out my IODS-project on GitHub!
str(lrn2014[-1])
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.75 2.88 3.88 3.5 3.75 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(lrn2014[-1])
## [1] 166 7
head(lrn2014[-1], 10)
## gender Age attitude deep stra surf Points
## 1 F 53 3.7 3.750 3.375 2.583333 25
## 2 M 55 3.1 2.875 2.750 3.166667 12
## 3 F 49 2.5 3.875 3.625 2.250000 24
## 4 M 53 3.5 3.500 3.125 2.250000 10
## 5 M 49 3.7 3.750 3.625 2.833333 22
## 6 F 38 3.8 4.875 3.625 2.416667 21
## 7 M 50 3.5 4.250 2.250 1.916667 21
## 8 F 37 2.9 3.500 4.000 2.833333 31
## 9 M 37 3.8 4.500 4.250 2.166667 24
## 10 F 42 2.1 4.000 3.500 3.000000 26
ggpairs(lrn2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
### p2 <- pairs(learning2014, , col=learning2014$gender)
ggplot(lrn2014, aes(x= Age, y=Points, col=attitude)) + geom_point() + facet_grid(. ~ gender) + ggtitle("Age, Attitude and Gender versus Course Points")
ggplot(lrn2014, aes(x= Age, y=Points, col=attitude)) + geom_point() + facet_grid(. ~ gender) + stat_smooth(method="lm", col="red") + ggtitle("Age, Attitude and Gender versus Course Points")
ggplot(lrn2014, aes(x= Age, y=Points, col=attitude)) + geom_point() + geom_smooth(method="lm") + ggtitle("Age, Attitude and Gender versus Course Points") + facet_grid(. ~ gender)
From the summary of the regression model below it can be deduced that age and male gender are negatively correlated to course points while attitude affects course points in a very positive way.
my_model2 <- lm(Points ~ Age + gender + attitude, data = lrn2014)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Age + gender + attitude, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## Age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## attitude 3.60657 0.59322 6.080 8.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
He are some diagnostic plots visualizing the model parameters: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
Here we read the joined student alcohol consumption data into R ( Remember that the path to the .csv file must be relative in order for it to work on the remote server!).
alc <- read.table("data/alc_mod.csv", header= TRUE, sep=",")
alc_orig <- alc
The combined dataset’s 35 variables are shown below and it has altogether 382 observations (rows)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [1] 382 35
The data includes the following variable datatypes: factorial, integer, numeric and logical. I used the pipe symbol to provide the output of the head()-function (prints out the observations according to the given second parameter) to the str()-function that shows the structure of the data.
head(alc, 0) %>% str()
## 'data.frame': 0 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS":
## $ sex : Factor w/ 2 levels "F","M":
## $ age : int
## $ address : Factor w/ 2 levels "R","U":
## $ famsize : Factor w/ 2 levels "GT3","LE3":
## $ Pstatus : Factor w/ 2 levels "A","T":
## $ Medu : int
## $ Fedu : int
## $ Mjob : Factor w/ 5 levels "at_home","health",..:
## $ Fjob : Factor w/ 5 levels "at_home","health",..:
## $ reason : Factor w/ 4 levels "course","home",..:
## $ nursery : Factor w/ 2 levels "no","yes":
## $ internet : Factor w/ 2 levels "no","yes":
## $ guardian : Factor w/ 3 levels "father","mother",..:
## $ traveltime: int
## $ studytime : int
## $ failures : int
## $ schoolsup : Factor w/ 2 levels "no","yes":
## $ famsup : Factor w/ 2 levels "no","yes":
## $ paid : Factor w/ 2 levels "no","yes":
## $ activities: Factor w/ 2 levels "no","yes":
## $ higher : Factor w/ 2 levels "no","yes":
## $ romantic : Factor w/ 2 levels "no","yes":
## $ famrel : int
## $ freetime : int
## $ goout : int
## $ Dalc : int
## $ Walc : int
## $ health : int
## $ absences : int
## $ G1 : int
## $ G2 : int
## $ G3 : int
## $ alc_use : num
## $ high_use : logi
Next I’m gonna present my hypothesis of the relationships between high/low alcohol use and four other variables I chose from the data: sex, study time, internet use and the final grade for the course
(Tämä poisSex and High alchol use vs. Final grade)
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 11.39744
## 2 F TRUE 42 11.71429
## 3 M FALSE 112 12.20536
## 4 M TRUE 72 10.27778
This summary indicates that high alcohol use affects more the mean final grade of males than the mean final grade of females.
From the barplot above we can also see, that high alcohol use and sex are somehow connected
The boxplots above indicates that high use of alcohol has a more negative effect on the final grades of males than the final grades of females. The final grade mean of heavy drinking men drops considerably when compared with heavy drinking women.
A different, point visualization of high alcohol usage, final grade and sex indicates the same (male alcohol usage affects more negatively the final grade than womens’ alcohol usage)
Producing summary statistics by group
## Source: local data frame [4 x 4]
## Groups: internet [?]
##
## internet high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 no FALSE 43 10.883721
## 2 no TRUE 15 9.933333
## 3 yes FALSE 225 11.897778
## 4 yes TRUE 99 10.939394
The summary statistics above indicates that internet at home and low alcohol usage has a positive effect on the final grade.
It looks like internet at home has no statistically important effect on the use of alcohol.
Low alcohol usage and internet at home seems to have some positive effect on the final grade for females.
Producing summary statistics by group
## Source: local data frame [8 x 4]
## Groups: studytime [?]
##
## studytime high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 1 FALSE 58 10.84483
## 2 1 TRUE 42 10.38095
## 3 2 FALSE 135 11.75556
## 4 2 TRUE 60 10.70000
## 5 3 FALSE 52 12.42308
## 6 3 TRUE 8 13.00000
## 7 4 FALSE 23 12.30435
## 8 4 TRUE 4 12.50000
The above summary statistics revealed something quite surprising. The combination of long study time and high alchohol usage seems to produce better final grades than other combinations.
The barplot above indicates that for women the highest studytime and for men the second highest studytime means the proprotionally lowest alcohol usage (in relation to high usage).
Next we’ll use glm() to fit a logistic regression model with high_use as the target variable and sex, internet, studytime and the final grade (G3) as the predictors. Then we’ll print out the summary of the model and finally the coefficients of the model.
The summary of the model
The fitted logistic regression model below shows the following.
##
## Call:
## glm(formula = high_use ~ sex + internet + studytime + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4103 -0.8419 -0.6579 1.1690 2.2315
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21682 0.55421 0.391 0.69563
## sexM 0.67755 0.24420 2.775 0.00553 **
## internetyes 0.29328 0.33791 0.868 0.38544
## studytime -0.43538 0.15902 -2.738 0.00618 **
## G3 -0.07324 0.03582 -2.045 0.04087 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 436.00 on 377 degrees of freedom
## AIC: 446
##
## Number of Fisher Scoring iterations: 4
The coefficients of the model
## (Intercept) sexM internetyes studytime G3
## 0.21682330 0.67754602 0.29328036 -0.43538191 -0.07323889
The computational target variable in the logistic regression model is the log of odds. That is why the exponents of the coefficients of a logistic regression model can be interpret as odds ratios between a unit change (vs no change) in the corresponding explanatory variable.
## (Intercept) sexM internetyes studytime G3
## 1.2421246 1.9690398 1.3408186 0.6470175 0.9293788
## Waiting for profiling to be done...
Let’s print out the odd ratios binded together with their confidence intervals!
## OR 2.5 % 97.5 %
## (Intercept) 1.2421246 0.4173118 3.6917548
## sexM 1.9690398 1.2229494 3.1911181
## internetyes 1.3408186 0.7044930 2.6702180
## studytime 0.6470175 0.4694243 0.8771147
## G3 0.9293788 0.8656554 0.9965402
From these confidence intervals we can conclude the following.
The reasons why the confidence interval indicating the accuracy of the sex : alcohol usage -estimate is the widest could include the following:
Of my chosen variables, sex and studytime had a statistical relationship with low/high alcohol consumption. Next we’ll explore the predictive power of the created model using these variables.
## prediction
## high_use FALSE
## FALSE 268
## TRUE 114
## [1] 382 37
## [1] 70.15707
## prediction
## high_use FALSE Sum
## FALSE 0.7015707 0.7015707
## TRUE 0.2984293 0.2984293
## Sum 1.0000000 1.0000000
Conclusions
Install.packages() and load() were used to load the MASS package into R. Also the dplyr and corrplot packages, which are used in the exercise, were installed and loaded into R.
Below is the structure of the Housing Values in Suburbs of Boston. The following characteristics of the dataframe can be discerned.
## [1] 506 14
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
First we’ll produce a matrix plot with the basic packages pairs.
Now it’s time to get a picture of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
The visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)
The above summaries and visualizations showed the following.
The moderate negative correlation between crime and the black population was a small surprise because the media usually depicts these two as having a strong correlational and even causational relationship.
Next we’ll center and standardize the variables.
boston_scaled <- scale(Boston)
Let’s look at the summaries of the scaled variables and see how the variables changed ( e.g. the means are now all at zero)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Here are the standard deviations of the variables.
Next we’ll create a categorical dataset of the crime rate variable in the scaled Boston dataset using quantiles as the break points in the categorical variable.
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# save the scaled crim as scaled_crim
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
class(boston_scaled)
## [1] "data.frame"
scaled_crim <- boston_scaled$crim
Let’s take a look at the summary of the scaled crim variable and then create a quantile vector of it. After that we’ll create a vector for the label names before cutting the scaled variable into bins.
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
The scaled crime variable has the largest range compared with other scaled variables (from -0.419400 to 9.924000). The median is considerably lower than the mean. This indicates that there are few observations that have a much higher positive crime value than the majority of observations.
bins <- quantile(scaled_crim)
range <- c("low","med_low","med_high","high")
crime <- cut(scaled_crim, breaks = bins,label=range, include.lowest = TRUE)
Now we can remove the original crim variable from the scaled dataset and add the categorical value to the dataset.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Let’s drop the old crime rate variable from the dataset We will want to use the boston_scaled (which is an object right now) as a data frame. So now we’ll change the object to data frame with as.data.frame() function.
# Here we'll divide the data in two parts: the train (80 %) and the test (20 %) sets
# first we'll get the numbers of rows and save it for the next step
n <- nrow(boston_scaled)
# Then we'll choose randomly 80 % of the data and create a train set
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
# Creating a test set (i.e. choosing the 20 % data that was not chosen by the sample() function) is based on what we did before
test <- boston_scaled[-ind,]
# Now we'll save the classes from the test data and remove the crime variable from the test data
# TARKISTA VIELÄ PITIKÖ NÄIN TEHDÄ: pitää, tehtävän 5. kohdassa, kopioi/siirrä seuraavat 2 riviä sinne
#correct_classes <- test$crime
#test <- dplyr::select(test, -crime)
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2425743 0.2425743 0.2599010
##
## Group means:
## zn indus chas nox rm
## low 0.94144265 -0.8707670 -0.081207697 -0.8579261 0.4248286
## med_low -0.05278104 -0.3832942 -0.031282114 -0.6392399 -0.1015470
## med_high -0.38135991 0.1792400 0.129415854 0.3590947 0.1953749
## high -0.48724019 1.0170492 -0.009855719 0.9981806 -0.3949399
## age dis rad tax ptratio
## low -0.8811389 0.7940831 -0.6819317 -0.7449845 -0.44988670
## med_low -0.4237096 0.4346806 -0.5459225 -0.5079267 -0.06335783
## med_high 0.4007338 -0.3949203 -0.4345915 -0.3239910 -0.28158456
## high 0.8360096 -0.8661984 1.6388211 1.5145512 0.78158339
## black lstat medv
## low 0.3803067 -0.77837210 0.54346038
## med_low 0.3558256 -0.17773068 0.01851179
## med_high 0.1173999 -0.08394996 0.24429271
## high -0.7821223 0.88659403 -0.64973316
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11015641 0.58340371 -1.08368702
## indus 0.05055479 -0.24413243 0.14962691
## chas -0.08821139 0.01920799 0.15844382
## nox 0.43993259 -0.86693776 -1.13806994
## rm -0.11241212 -0.14659654 -0.07635778
## age 0.16035057 -0.33446075 -0.06596761
## dis -0.08492826 -0.22244758 0.42216149
## rad 3.30264583 1.12322574 -0.29838195
## tax 0.09456941 -0.13108943 0.73184566
## ptratio 0.13148535 -0.02501289 -0.24165533
## black -0.14109981 0.06083228 0.15355373
## lstat 0.24472932 -0.18925980 0.51365275
## medv 0.21800766 -0.40020482 -0.09946797
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9546 0.0336 0.0119
The biplotted LDA results shows the following.
After the crime categories have been saved for later use and the crime variable have been removed from the test data set like this: dplyr::select(test, -crime), it’s time to predict the classes with the LDA model on the test data.
## predicted
## correct low med_low med_high high
## low 15 7 2 0
## med_low 2 16 10 0
## med_high 0 11 15 2
## high 0 0 0 22
The prediction of the classes with the LDA model ended with following results which are in the order for the succes of the prediction
All and all it can be summarized that the LDA model predicted the classes moderately well.
After reloading the Boston data the next task is to calculate the distances among the datapoints with the dist() function (which will save the distances in a matrix table). Below we see the summaries and heads (the first 6 distances in the matrix) of two types of distances, 1) euclidian and 2) manhattan.
Euclidian distances
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
## [1] 1.935677 2.280929 2.574117 2.690602 2.240723 1.784194
Manhattan distances
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4830 12.6100 13.5500 17.7600 48.8600
## [1] 5.619451 6.424933 7.283815 7.244650 5.670673 4.497673
Using K-means to produce the clusters and then the the result is plotted.
The main goal is to find out what is the optimal number of clusters. This can be done by using different values for the kmeans’ centers-argument. In practise trial and error might be a difficult path. By using the with in cluster sum of squares (WCSS) in combination with k-means and plotting the task becomes easier. The point where WCSS drops radically is the optimal number of clusters.
The dropping point of the WCSS visualization was 2, so below is the replotted matrix with clusters -visualization with the optimal number of clusters (2).
Below is the structure of the human dataset. The following characteristics of the dataframe can be discerned.
## [1] 155 8
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Here are the standard deviations of the variables. (Add some analysis here)
## sd(Edu2.FM) sd(Labo.FM) sd(Edu.Exp) sd(Life.Exp) sd(GNI) sd(Mat.Mor)
## 1 0.2416396 0.1987786 2.840251 8.332064 18543.85 211.7896
## sd(Ado.Birth) sd(Parli.F)
## 1 41.11205 11.48775
First we’ll produce a matrix plot with the basic packages pairs.
Now it’s time to get a picture of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
## Edu.Exp 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
## Life.Exp 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
## GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
## Mat.Mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
The visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)
The above summaries and visualizations showed the following.
#
# boxplots
Let’s look at the summaries of the scaled variables and see how the variables changed ( e.g. the means are now all at zero)
bar plots
Next we’ll perform principal component analysis (PCA) on the not standardized human data and show variability captured by the principal components.
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
Then we’ll draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions in you plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to. (0-4 points)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
## num [1:155, 1:8] 0.639 0.596 0.54 0.562 0.481 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
## ..$ : chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
## - attr(*, "scaled:center")= Named num [1:8] 8.53e-01 7.07e-01 1.32e+01 7.17e+01 1.76e+04 ...
## ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
## - attr(*, "scaled:scale")= Named num [1:8] 2.42e-01 1.99e-01 2.84 8.33 1.85e+04 ...
## ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Let’s interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Including captions in the plots where I describe the results by using not just your variable names, but the actual phenomenons they relate to. (0-4 points)
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
# boston_scaled <- dplyr::select(boston_scaled, -crim)
# boston_scaled <- data.frame(boston_scaled, crime)
# table(crime)
My personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
Next we’ll load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
# Install Factominer and load in the tea dataset
#install.packages("FactoMineR")
library(FactoMineR)
data(tea)
Let’s explore the data briefly: look at the structure and the dimensions of the data and visualize it.
## [1] "Tea" "How" "how" "sugar" "where" "lunch"
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] "breakfast" "tea.time" "evening"
## [4] "lunch" "dinner" "always"
## [7] "home" "work" "tearoom"
## [10] "friends" "resto" "pub"
## [13] "Tea" "How" "sugar"
## [16] "how" "where" "price"
## [19] "age" "sex" "SPC"
## [22] "Sport" "age_Q" "frequency"
## [25] "escape.exoticism" "spirituality" "healthy"
## [28] "diuretic" "friendliness" "iron.absorption"
## [31] "feminine" "sophisticated" "slimming"
## [34] "exciting" "relaxing" "effect.on.health"
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## Warning: attributes are not identical across measure variables; they will
## be dropped
Here’s a summary of the tea data’s variables and the distributions within their levels.
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
Let’s take a look at the correlations of the tea dataset with cor() (produces a correlation matrix) and corrplot() (visualizes the correlation matrix as pairs)
Let’s do the multiple correspondence analysis of selected tea variables.
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
## [1] "eig" "call" "ind" "var" "svd"